My brother and a couple of his friends provide movie reviews on the site Criticker.com. I thought it would be fun to provide them with a text analysis of their movie reviews. What follows is that analysis.
The workflow for this project will start with an exploration of the data, looking at single words and bigrams and end with sentence level sentiment analysis.
Since the readers this post was created for are not R users, the code has been left out. The code for this analysis can be found on my Github page.
You can familarize yourself with the reviews at the three reviewer pages below, and I recommend a quick read if you enjoy movies!
The dataset is simple. It is comprised of an xml export for each user and after cleaning, there are only four columns:
filmname: Name of the movie
quote: The movie review
reviewer: The first name of the reviewer
rating: The score each person attributes to the movie
The following steps were taken to preprocess the data.
First, I only cared about movies that actually had a review so for any movies that just had a score were removed.
Second, the movie reviews were tokenized. Tokenization means taking a review and breaking it out into one word per row. This makes the dataset significantly longer.
Finally, a significant amount of words in the English language are not useful for sentiment analysis or prediction. Some examples are ‘and’ and ‘the’. These words are otherwise known as stop words and have been removed from the dataset.
A good place to start with any analysis is with an exploration of the data.
First, a quick look at how many reviews have been created and how many total words have been written for each reviewer.
| Review Summary | |||||||
|---|---|---|---|---|---|---|---|
| Summary Statistics for Each Reviewer | |||||||
| Reviewer | Review Count | Total Word Count | Word Count excl Stop Words | Avg Word Count per Review | Shortest Review | Longest Review | % of Stop Words |
| Justin | 1,261 | 85,100 | 29,576 | 68 | 1 | 106 | 65.0% |
| Tyler | 2,092 | 140,954 | 43,729 | 67 | 1 | 107 | 69.0% |
| Zach | 604 | 37,305 | 16,022 | 62 | 1 | 100 | 57.0% |
Some observations from the summary statistics. Zach has the fewest total reviews while Tyler has the most. All 3 reviewers have the same average word count per reviews. All three have a one word review while the longest reviews for each are all around the same word count. Finally, Zach has the lowest percentage of stop words among the three leading me to believe his reviews may be more concise than the other two reviewers.
Second, I will take a look at the most common words from all the reviewers combined.
Looking at the image above, the top word used across all three reviewers is ‘movie’, which is not a big surprise. Furthermore, none of the words on the list look out of place for movie reviewers.
You will notice that “movies” is also a popular word. I could have stemmed the dataset to remove that sort of thing from happening but I wanted the three reviewers to see the words as they were written. In case anyone does not know, stemming is the process of reducing a word to its stem that affixes to suffixes and prefixes. An example: “fishing”, “fished” and “fisher” after stemming might be reduced to the word “fish”.
After knowing the top ten words, lets look at the top 10 words for each reviewer.
Generally speaking, both Tyler and Justin use some pretty similar words. What is interesting to me is that Zach actually uses ‘film’ more often than ‘movie’, which is less popular between Tyler and Justin. Also, Zach likes to talk about scenes more often than both Tyler or Justin.
Previously we only looked at the top 10 words for each reviewer. Below you will find a visual representation (wordcloud) of all words for each reviewer. To read the wordcloud, the largest text is the word appearing most frequently in the reviews. Below are the wordclouds for Justin, Tyler and Zach.
After looking at the top ten words and all the words for each reviewer, we can visually look at the frequencies of all the words by a reviewer compared to another reviewer.
Above, the frequencies of each word are plotted against each reviewer. Words along the dotted line have similar frequencies between the two reviewers. So, in the left panel, both Tyler and Justin use “movies”, “bad”, and “character” with similar frequencies. Words that are farther from the line are words that appear more often in one reviewers text than the other. For example from the right panel, Zach the words “disney”, and “viewer” appear more frequently than in Tyler’s while Tyler uses “pretty” and “nice” more often.
After looking at the counts and frequency of single words, it is time to look at the relationship between words. The first part of section we will be looking at n-grams. N-grams are sets of adjacent words. For example, bigrams are two words in the text that are adaject and trigrams are three adjacent words.
Below are the top bigrams for each reviewer.
Now it is starting to get interesting. I would guess that all three reviewers prefer sci-fi movies because it is the most popular adjacent words in the data sets for them. Zach appears to really like the actor Robert Deniro because he is in the top ten bigrams. Tyler and Justin also appear to watch quite a few horror movies, but Tyler appears to watch more action movies than horror movies. Zach appears to discuss more of the meta parts of films, including special effects, voice acting and physical comedy as well as having a place in his heart for movies with love stories.
As with individual words, we also want to take a look at not just the top ten bigrams, but also the relationship between all bigrams. We will do this using a network graph. A network graph shows the connections between nodes, which are words in this case. Network graphs are really popular in social media data analytics.
We will look at each reviewer separately, starting with Justin. Note, shape of the graphs do not matter. They are randomly generated each time the code is run so no meaning should be derived from them.
A quick glance at the network graph shows some normal connections, for example around ‘movie’. It is connected to typical words related to movies like ‘kids’, ‘horror’, and ‘comic’. He appears to also like to use phrases such as ‘totally worth watching’ or ‘totally worth checking’. It also appears that he likes to talk about actors because they appear frequently: ‘Tom Cruise’,‘Liam Neeson’, and ‘Martin Lawrence’. I’m also trying to figure out why he talks about ‘pro wresting’ so much.
Below is the network graph of Tyler’s bigrams. Like Justin, Tyler also likes to mention actor names, like ‘Tom Cruise’. The most interesting part (to me) is the cluster around ‘movie(s)’. Pretty extinsively, Tyler uses the word ‘movie’ but not just for talking about the genre, but also his feelings about the movie: ‘forgettable’, ‘pretty’, and ‘funniest’.
Finally, we move onto Zach’s bigrams. The first thing you probably noticed is that Zach’s network graph is quite a bit more sparse than the others. That is because he has the lowest review count and so there is a smaller mount of adjacent words. Zach’s reviews appear to frequently discuss the characters, whether they were main or support and if they were memorable. He also apparently likes true love stories.
Now that we have taken a look at the individual and adjacent words, it is time to look at the sentiment of the movie reviews. We are not going to look at the sentiment of individual words because it is a bit too primitive and the English language is syntactically complex and lexically rich.
The algorithm I will use is a bit better than sentiment by word. It uses “valence shifters” that help adjust the sentiment score. For example, if you do sentiment analysis on the single word “happy”, the score is positive. Obviously though, if the phrase is “not happy” it is no longer positive but a single word sentiment analysis would not pick that up. The valence shifters will help adjust “not happy” to negative or at least less positive. Also, the sentiment score that will be returned will be an aggregate of each sentence for each review and movie, that way we only look at a single score per reviewer and movie. A score > 0 implies an overall positive sentiment.
First, here are the distributions of the scores by reviewer.
Justin’s distribution of sentiment seems to cluster close to 0 meaning his reviews are a bit more neutral. Tyler and Zach have a heavier tail on the right, with both distributions looking pretty close to a normal distribution.
All three skew to the right. Tyler and Zach seem to have a greater tendency to higher average positive scores for movies. Scatter plot of rating to avg sentiment
Now, let’s take a look at how well the average sentiment scores relate to the actual ratings.
Reviewing the three scatterplots above, there definitely appears to be a positive relationship between rating and average sentiment because as ratings increase the average sentiment also increases. However, there also seems to be quite a high level of variablity within the average sentiment for each review. I suspect part of the variability could be explained by the underlying dataset used for sentiment scoring is not training based on movie review data. For example, in the reviews I am certain the word “plot” appears often. In movie reviews “plot” is a talking point, but otherwise it can be negative like someone is plotting to do something bad.
There are two more avenues I want to explore further with this dataset. One is some classification modeling to see if I can classify reviews into the labels that each of the reviewers have provided for me.
The final project is I want to take a single export of criticker reviews and plug them into a Shiny app that will provide some analysis on the reviews. I have not fleshed out what exactly this app will do yet, but some ideas off the top of my head is analyzing the ratings by things like genre and time period as well as some more sentiment analysis.